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Abstract 


While hierarchical machine learning approaches have been 
used to classify texts into different content areas, this ap- 
proach has, to our knowledge, not been used in the automat- 
ed assessment of text difficulty. This study compared the 
accuracy of four classification machine learning approaches 
(flat, one-vs-one, one-vs-all, and hierarchical) using natural 
language processing features in predicting human ratings of 
text difficulty for two sets of texts. The hierarchical classifi- 
cation was the most accurate for the two text sets considered 
individually (Set A, 77.78%; Set B, 82.05%), while the non- 
hierarchical approaches, one-vs-one and one-vs-all, per- 
formed similar to the hierarchical classification for the 
combined set (71.43%). These findings suggest both prom- 
ise and limitations for applying hierarchical approaches to 
text difficulty classification. It may be beneficial to apply a 
recursive top-down approach to discriminate the subsets of 
classes that are at the top of the hierarchy and less related, 
and then further separate the classes into subsets that may 
be more similar to one other. These results also suggest that 
a single approach may not always work for all types of da- 
tasets and that it is important to evaluate which machine 
learning approach and algorithm works best for particular 
datasets. The authors encourage more work in this area to 
help suggest which types of algorithms work best as a func- 
tion of the type of dataset. 


Introduction 


Gaining new knowledge, in both formal and informal envi- 
ronments, relies heavily on learning from text. An im- 
portant component of the comprehension process is the 
difficulty of the text being read. Texts that are too difficult 
can impede comprehension. Educators find texts that are 
grade appropriate and may also need to select texts that 
meet the need of individual student. Given the abundance 
of text materials available, educators simply do not have 
the time to find and thoroughly evaluate texts for this pur- 
pose. As such, educators depend on text difficulty formulas 
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to quickly identify appropriately challenging texts. Com- 
mon readability formulas such as Flesch-Kincaid Reading 
Ease (Flesch, 1948) that assess text difficulty are usually 
based on number of syllables per word, number of words 
per sentence, and the number of sentences (Klare, 1974). 
Though easy to use, these formulas are centered on rela- 
tively shallow lexical and sentential indices. However, 
theories of reading comprehension suggest that deep fea- 
tures related to syntax and semantics drive text difficulty 
(Dufty et al., 2006; Duran et al., 2007; McNamara, 
Graesser, and Louwerse, 2012). To address this issue, re- 
searchers have developed natural language processing 
(NLP) tools that extract richer information about the lin- 
guistic features of a text that reflect complex dimensions 
such as narrativity, syntactic complexity, and cohesion 
(e.g. Crossley et al., 2016; McNamara, et al., 2014). 

Researchers have also begun to employ machine- 
learning approaches for measuring text readability (Col- 
lins-Thompson, 2014; Kate et al., 2010; Kotani, Yoshimi, 
and Isahara, 2011; Pilan, Volodina, and Johansson, 2014). 
These approaches have shown promise in more accurately 
assessing text difficulty as compared to “classic” readabil- 
ity approaches (Fran¢ois and Miltsakaki, 2012). 

Though promising, most of this work has focused on ei- 
ther determining the best set of linguistic features or com- 
paring regression and classification approaches (e.g., 
Francois and Miltsakaki, 2012; Heilman et al., 2008). To 
our knowledge, there is little work investigating the poten- 
tial for hierarchical approaches in the classification of text 
difficulty. Hierarchical classification has been used in a 
number of areas such as protein classification (Zimek et 
al., 2008), essays scoring (McNamara et al., 2015), and 
automatic target recognition (Casasent and Wang, 2005). 
This study addresses this gap in the literature by combining 
NLP and machine learning to compare multiple types of 
classification in their accuracy of classifying text difficulty. 

We first provide brief description of the relevant NLP 
tools and machine learning techniques and then present 
results of the experiments. 


Natural Language Processing 


LP intersects computational linguistics, computer science, 
and artificial intelligence to understand, assess, and re- 
spond to naturally occurring human language. NLP has 
been used in education to support student learning, for in- 
telligent and automatic assessments, to improve learning 
and teaching in massive open online courses (MOOCs), 
and to develop learning systems. In this study, we em- 
ployed the NLP tool, Coh-Metrix (McNamara et al., 2014), 
which integrates a number of sophisticated tools such as 
advanced syntactic parsers, part-of-speech taggers, distri- 
butional models, and psycholinguistic databases (Coltheart, 
1981) to generate over 400 indices of language, text, and 
readability. 


Machine Learning 


Machine learning algorithms are categorized as unsuper- 
vised and supervised. Unsupervised learning uses data that 
is not labeled, whereas in supervised machine learning, the 
algorithms are trained on labeled data. For supervised algo- 
rithms, regression is used to predict quantitative variables, 
whereas classification is used to predict qualitative varia- 
bles (Hastie, Tibshirani, and Friedman, 2009; James et al., 
2013). As our data involves human ratings of categories 
(i.e. labeled categorical data), we adopted a supervised 
learning classification approach. 

Commonly used classification algorithms include Deci- 
sion Trees, Naive Bayes, Linear Discriminant Analysis 
(LDA), Support Vector Machines (SVM), Logistic Regres- 
sion, Random Forests, Neural Networks, and Boosting (for 
further description of these algorithms, see Balyan, McCar- 
thy, and McNamara, 2017; Hastie et al., 2009). Preliminary 
experiments with a number of these algorithms indicated 
that SVM and LDA were the most accurate for the current 
data. 


Non-Hierarchical Classification 


Three types of non-hierarchical approaches were used in 
this study. Flat classification is the simplest and the most 
direct approach to category classification. It uses single 
classifier and all class variable instances in the training 
dataset. The other two non-hierarchical approaches: one- 
ys-one and one-vs-all use SVM, which relies on the con- 
struction of multiple hyperplanes. SVMs are typically de- 
signed for binary classification (Duda, Hart, and Stork, 
2000). However, one-vs-one and one-vs-all are the two 
most popular approaches for extending SVMs to classify 
K-classes (K>2). The one-vs-one classification approach 
forms a binary classifier for each class-pair and hence 
k*(k-1)/2 classifiers are required. In contrast, one-vs-all 
classification compares each class to all other classes (Du- 
da et al., 2000; James et al., 2013). 


422 


Hierarchical Classification 


The one-vs-all approach does not consider similarity be- 
tween the classes (Casasent and Wang, 2005). Additional- 
ly, one-vs-one approach is not attractive when K is large 
(Kumar et al., 2002). A more preferred approach is a bina- 
ry hierarchical classification (Wang and Casasent, 2009). 
This approach is based on the divide-and-conquer strategy, 
and learns concepts more effectively and efficiently (Ku- 
mar and Ghosh, 1999). In some cases, the hierarchical 
classifier has performed better than some single complex 
classifiers, such as neural networks (NNs) and kNN classi- 
fiers (Kumar, Gosh, and Crawford, 2002; Schwenker, 
2000). In binary hierarchical classification, the classes are 
divided into two smaller macro-classes at each node. Only 
logoK classifiers need to be traversed in order to move 
from the top to a bottom decision node. 


Current Study 


This study leverages NLP to compare machine learning 
classification approaches in their accuracy of classifying 
human ratings of text difficulty. We conducted these ex- 
periments in context of text development for our reading 
comprehension intelligent tutor, Interactive Strategy Train- 
ing for Active Reading and Thinking GSTART; McNama- 
ra, Levinstein, and Boothum, 2004). This study is not only 
theoretically interesting, but also has important applica- 
tions. By using accurate NLP tools and machine learning 
algorithms, we can automate the process of classifying new 
texts within the iSTART text library. If successful, the cur- 
rent approach will permit researchers and teachers to add 
their own texts into the text library ensuring that the diffi- 
culty levels assigned in the system remain consistent over 
time. 

We used two text sets, both individually and combined, 
to compare not only different classifiers, but also four dif- 
ferent approaches. The first set of experiments identified 
the most accurate classification algorithms (SVM and 
LDA) and the second set of experiments established which 
approach (flat, one-vs-one, one-vs-all, and hierarchical) 
was most effective for classifying human ratings of text 
difficulty. 


Method 
Corpus 


The text corpus was comprised of two text sets developed 
for iSTART, an intelligent tutoring system (ITS) that sup- 
ports successful reading comprehension of complex infor- 
mational texts through self-explanation training (McNama- 
ra, Levinstein, and Boonthum, 2004; Snow et al., 2016). 
Set A was comprised of texts from the iSTART 
StairStepper module (n = 162), including expository texts, 
ranging in topics from science, history, pop culture, and 


sports. These texts were collected from reading compre- 
hension tests culled from publicly available websites (Per- 
ret, et al., 2017). Set B was comprised of texts from 
iSTART’s main text library (n = 100), which are used for 
self-explanation practice and within various games. These 
texts are complex, informational texts about scientific phe- 
nomena compiled when developing the practice modules 
within iSTART (Jackson and McNamara, 2013). The texts 
were culled from various sources, primarily science text- 
books. Whereas Set A included texts that varied widely in 
genre and difficulty (i.e., grades 1-12), Set B comprised 
texts used to provide information typical in high school 
and college science courses. Hence, the two sets of texts 
were quite different in nature. 

The difficulty of the texts in the two sets were rated sep- 
arately, but followed the same procedure (see Johnson, et 
al., 2017). Set A was sorted into 12 levels of difficulty and 
Set B was sorted into 9 levels. We compared the levels 
across the two sets and determined that the easiest texts in 
Set B were of equal difficulty to Set A texts rated as diffi- 
culty level 6. Thus, combining Set A (1-12) and Set B (6- 
14) texts resulted in a corpus that included 262 texts cate- 
gorized into 14 difficulty levels. 

Initial machine learning experiments that considered all 
text difficulty levels (1-12 for Set A and 6-14 for Set B) 
resulted in low accuracy: 25.97% to 33.95% for Set A and 
29.17% to 35.42% for Set B. The accuracy decreased fur- 
ther when the two sets were combined: ranging from 
19.44% to 26.39%. Consequently, we clustered the 14 lev- 
els into more coarse-grained levels. The researchers re-read 
the texts in each difficulty level to identify intuitive breaks 
in the text set. This resulted in four difficulty levels: low 
(1-4), middle (5-8), high (9-12), and very high (13 and 14). 
Set A included low, middle, and high difficulty texts, while 
Set B included middle, high, and very high difficulty texts. 
These four levels are roughly aligned with levels of school- 
ing in the United States: elementary (1-4), middle school 
(5-8), high school (9-12), and college-appropriate (13 and 
14). 


Selection of Linguistic Indices 


We selected 11 linguist indices related to lexical sophisti- 
cation, readability, lexical diversity, and syntactic com- 
plexity that have been shown to correlate with text difficul- 
ty (e.g., Crossley, Allen, and McNamara, 2012; Salsbury, 
Crossley, and McNamara, 2011). Removing highly corre- 
lated indices (Pearson’s r > .85) reduced this to eight indi- 
ces. The following sections provide brief descriptions of 
each of these indices used for the experiments. 


Flesch-Kincaid Grade Level 
Flesch-Kincaid Grade Level (FKGL; Kincaid et al., 1975) 
is a simple measure of readability computed using average 
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syllables per word (ASW) and average sentence length 
(ASL). Lower grade levels correspond to easier texts. 


L2 Readability 

L2 readability score predicts the readability of texts for 
second-language learners (Crossley, Allen, and McNama- 
ra, 2012). L2 readability score considers content word 
overlap, sentence syntactic similarity, and word frequency. 
In contrast to FKGL, higher scores indicate easier texts. 


Syntactic Complexity 

Syntactic complexity of a sentence is determined by con- 
sidering mean number of words before the main verb and 
higher number of higher-level constituents per word in the 
sentence. Sentences with less syntactic complexity are eas- 
ier to process and comprehend (Crossley, Allen, and 
McNamara, 2012; Perfetti, Landi, and Oakhill, 2005). 


Uncommon or Rare Words 

Uncommon or rare word indices are indicative of the fre- 
quency that a word occurs in the English language. More 
uncommon or rare words in a text render the text more 
difficult. The text difficulty is expected to increase if there 
are words that readers have never or rarely encountered. 
This index is computed from CELEX (Baayen, Pie- 
penbrock, and Gulikers, 1995), a 17.9 million words cor- 
pus. 


Lexical Diversity 

Lexical diversity, or the variety of words used in a text, is 
often measured using type-token ratios (TTR). However, 
this measure is highly correlated to text length. In order to 
assess lexical diversity regardless of text length, we em- 
ployed MTLD (measure of textual, lexical diversity; 
McCarthy, 2005) and D values (Malvern et al., 2004; 
McNamara, Crossley, and Roscoe, 2013). 


Word Familiarity 

Sentences that contain words that are more familiar are 
processed more quickly (McNamara et al., 2013). Word 
familiarity is a human rating of how easily adults recognize 
a word. For example, the word ‘dog’ has a higher average 
familiarity than ‘cortex’. Word familiarity ratings are com- 
puted using the MRC Psycholinguistic Database, which 
provides ratings for several thousand words along several 
psychological dimensions. 

Word Imagability 

Word imagability refers to the ease with which one can 
construct a mental image of a word in one’s mind. For ex- 
ample, ‘airplane’ and ‘hammer’, are more imagable than 
‘dogma’ and ‘quantum’. 


Table 1: Means and ANOVA Results for the selected linguistic feature 


Set A Set B A+B 
: ‘ F < Very F Fi , Very F 
Feature Low Middle High (2,159) Middle High High (2,97) Low Middle High High (3,258) 
FKGL 5.87 8.83 10.83 | 113.40 9.17 9.25 11.56 | 17.03 5.87 8.93 10.08 11.56 76.78 
L2 Reada- 
bility 20.64 | 14.41 12.46 46.17 18.12 17.25 14.34 435 | 20.64 | 15.47 14.72 14.34 16.80 
Syotiche 0.73 0.68 0.69 13.18 0.72 0.71 0.68 | 15.71 0.73 0.70 0.70 0.68 11.53 
Complexity 
Pucemon 49.52 | 94.87 | 107.41 21.37 57.92 62.42 81.91 8.39 | 49.52 | 84.25 86.15 81.91 8.64 
/rare Words 
Lexical 
Diversity 66.41 | 78.46 81.06 9.69 56.41 57.32 57.15 0.03 66.41 | 72.12 69.84 | 57.15 3.92 
(MTLD) 
Age of 
Acquisition 5.13 5.53 5.82 | 128.40 5.69 5.97 6.39 | 25.07 5.13 5.57 5.89 6.39 | 130.00 
(AoA) 
Imagability | 355.11 | 347.74 | 336.18 22.46 | 333.54 | 329.06 | 317.99 5.78 | 355.11 | 343.66 | 332.81 | 317.99 36.58 
Familiarity | 588.88 | 587.25 | 585.71 4.53 | 588.29 | 588.51 | 588.21 0.04 | 588.88 | 587.55 | 587.03 | 588.21 1.45 
sede (59.47%). LDA classifier achieved the highest accuracy for 
Age of Acquisition 


Age of Acquisition is an MRC Psycholinguistic Database 
index that refers to the age at which a word first appears in 
a child’s vocabulary. 


A series of analysis of variance (ANOVA) were con- 
ducted to determine if each linguistic feature differed as a 
function of the human ratings of text difficulty for the three 
different datasets (Set A, Set B, and the combined set re- 
ferred to A+B). Means and ANOVA results are shown in 
Table 1. 

In the experiments, the omnibus ANOVAs indicated a 
significant effect of text difficulty level for all indices ex- 
cept lexical diversity for Set B and familiarity for Sets B 
and A+B. Post hoc tests revealed that these effects were 
driven by differences between the low level compared to 
the high and very high difficulty levels. Few of the indices 
showed differences between the middle and high levels. 
These tests as a whole, however, confirmed that the lin- 
guistic features vary significantly across text difficulty 
levels. Consequently, we used these linguistic features in 
the subsequent machine learning experiments. 


Results 


Non-Hierarchical Classification 


We used ten-fold cross validation to assess the accuracy of 
the models. The accuracy of the non-hierarchical classifi- 
cation approaches are shown in Table 2. The highest accu- 
racy for the flat classification approach was for Set A 
(76.19%) and the lowest accuracy was for the A+B dataset 
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all the datasets for this approach. 


Because Set A and Set B each contained three text diffi- 
culty levels (classes), we constructed three classifiers for 
the one-vs-one and one-vs-all classification tasks. The 
A+B dataset contained four levels, so we constructed six 
classifiers for the one-vs-one approach and four classifiers 
for the one-vs-all approach. SVM was used to train the 
models. The bold entries in Table 2 indicate highest accu- 
racy for each dataset. 


Table 2: Accuracy for non-hierarchical classification 


Data Source 


Approach Classifier SetA SetB AIB 
Flat LDA 76.19* 71.79 59.47 
one-vs-one SVM 76.19* 71.79 71.43* 
one-vs-all SVM 74.60 74.36* 71.43* 


* Highest Accuracy 


For Set A, the one-vs-one approach and flat were the 
most accurate. In contrast, for the Set B the one-vs-all was 
the most accurate. The one-vs-one and one-vs-all per- 
formed similarly for A+B dataset. 


Hierarchical Classification 


For hierarchical classification, we conducted three experi- 
ments with multiple runs. For example, in the first run of 
Set A, we first classified texts into two classes as ‘low’ and 
‘other’. At the second level, the ‘other’ class was further 
classified into ‘middle’ and ‘high’. For the second run, the 
texts were first classified as ‘middle’ and ‘other’ and then 
the ‘other’ texts were further classified as ‘low’ and ‘high’. 
Finally, for the third run, the texts were first classified as 


‘high’ and ‘other’ and then the ‘other’ texts were reclassi- 
fied as ‘low’ and ‘middle’. A summary of these experi- 
ments for different data combinations is provided in Table 
3: 


Table 3: Hierarchical Classification Experiments Summary 


Experiment SetA Set B A+B 

Run 1 L+(M/H) M+(H/VH)_ (L/M)+ (H/VH) 
Run 2 M+(L/H) H+(M/VH) _ (L/H)+(M/VH) 
Run 3 H+(L/M) VH+(M/H) _ (L/VH)+(M/H) 


L: Low, M: Middle, H: High, VH: Very High 


The classification accuracy of the final model for all the 
three experiments is summarized in Table 4. We observed 
that hierarchical classification (Run 1) improved the accu- 
racy of the model for Set B and the A+B dataset signifi- 
cantly compared with the previous approach (flat classifi- 
cation). In contrast, there was only slight improvement for 
Set A (Run 3) over the previous approaches. 


Table 4: Accuracy for Hierarchical Classification 


Data Source 


Experiment Classifier Sa See AGB 
Run | LDA andSVM 74.60 = 82.05* = 71.43* 
Run 2 LDA andSVM 76.19 = 58.97 66.23 
Run 3 LDA andSVM _ 77.78* 57.89 70.13 


* Highest accuracy 


In sum, we found that hierarchical classification 
achieved the highest accuracy for Set A (77.78%) and Set 
B (82.05%). For A+B dataset, we achieved the highest 
accuracy (71.43%) using both hierarchical as well as the 
two non-hierarchical approaches, one-vs-one and one-vs- 
all. 


Discussion 


This study compared different supervised machine learning 
approaches in classifying human ratings of text difficulty 
Our approach was novel in that we submitted the texts to 
multiple types of machine learning approaches: flat, one- 
vs-one, one-vs-all, and hierarchical. These experiments 
demonstrated the potential of using hierarchical approaches 
in text difficulty classification, but also indicated that no 
one single approach was most accurate. 

When classifying Set A and Set B text sets independent- 
ly, the most accurate approach was hierarchical classifica- 
tion. However, when the two text sets were combined, one- 
vs-one, one-vs-all, and hierarchical approaches performed 
similarly. The differences in the accuracy of these ap- 
proaches suggest that there are potential differences in the 
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nature of the texts in each text set. As seen in Table 1, the 
differences between the middle and high text sets were 
lessened when the sets were combined. Set B texts are sci- 
entific texts appropriate for high school and college stu- 
dents, whereas Set A texts were designed to include a 
broader range of reading skills and topics. Given that the 
two sets were developed for different purposes and rated 
independently of one another, it was expected that they 
would not be perfectly comparable. 

At an applied level, we plan to implement separate algo- 
rithms to classify text difficulty depending on whether the 
text will be included in Set A or in Set B within iSTART. 
This automated classification allows us to continue to per- 
mit teachers to add their own texts to the system, while 
providing an adaptive environment in which students are 
presented with skill-level appropriate readings. 
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